AITopics | training system

Collaborating Authors

training system

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Skrull: Towards Efficient Long Context Fine-tuning through Dynamic Data Scheduling

Neural Information Processing SystemsJun-13-2026, 17:23:36 GMT

Long-context supervised fine-tuning (Long-SFT) plays a vital role in enhancing the performance of large language models (LLMs) on long-context tasks. To smoothly adapt LLMs to long-context scenarios, this process typically entails training on mixed datasets containing both long and short sequences. However, this heterogeneous sequence length distribution poses significant challenges for existing training systems, as they fail to simultaneously achieve high training efficiency for both long and short sequences, resulting in sub-optimal end-to-end system performance in Long-SFT. In this paper, we present a novel perspective on data scheduling to address the challenges posed by the heterogeneous data distributions in Long-SFT. We propose Skrull, a dynamic data scheduler specifically designed for efficient long-SFT.

artificial intelligence, large language model, natural language, (14 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.85)

Add feedback

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

Neural Information Processing SystemsDec-24-2025, 18:31:57 GMT

There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others, aim to improve the system's overall throughput. In this paper, we aim to address a common bottleneck in the RL training system, i.e., parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop and a modest workstation, to a high-end machine such as NVIDIA DGX-A100. On a high-end machine, EnvPool achieves one million frames per second for the environment execution on Atari environments and three million frames per second on MuJoCo environments. When running EnvPool on a laptop, the speed is 2.8x that of the Python subprocess. Moreover, great compatibility with existing RL training libraries has been demonstrated in the open-sourced community, including CleanRL, rl_games, DeepMind Acme, etc. Finally, EnvPool allows researchers to iterate their ideas at a much faster pace and has great potential to become the de facto RL environment execution engine. Example runs show that it only takes five minutes to train agents to play Atari Pong and MuJoCo Ant on a laptop. EnvPool is open-sourced at https://github.com/sail-sg/envpool.

envpool, name change, reinforcement learning environment execution engine, (4 more...)

Neural Information Processing Systems

Industry: Education (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.78)

Add feedback

RL in the Wild: Characterizing RLVR Training in LLM Deployment

Zhou, Jiecheng, Hu, Qinghao, Jin, Yuyang, Wang, Zerui, Sun, Peng, Gu, Yuzhe, Zhang, Wenwei, Zhai, Mingshu, Zhang, Xingcheng, Zhang, Weiming

arXiv.org Artificial IntelligenceOct-14-2025

Large Language Models (LLMs) are now widely used across many domains. With their rapid development, Reinforcement Learning with V erifiable Rewards (RL VR) has surged in recent months to enhance their reasoning and understanding abilities. However, its complex data flows, and diverse tasks pose substantial challenges to RL training systems, and there is limited understanding of RL VR from a system perspective. To thoroughly understand the system challenges introduced by RL VR, we present a characterization study of RL VR tasks in our LLM deployment. Specifically, we investigate the distribution and variation trends of workloads across different RL tasks across training steps. We identify issues such as GPU idling caused by skewed sequence length distribution, inefficient parallel strategies in dynamically varying workloads, inefficient data management mechanisms, and load imbalance. We describe our observations and call for further investigation into the remaining open challenges. Furthermore, we propose PolyTrace benchmark suite to conduct evaluation with realistic workloads, a practical use case validates that PolyTrace benchmark suite exhibits 94.7% accuracy.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2509.25279

Country:

Asia (0.46)
North America > United States (0.46)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Probing Experts' Perspectives on AI-Assisted Public Speaking Training

Fourati, Nesrine, Barkar, Alisa, Dragée, Marion, Danthon-Lefebvre, Liv, Chollet, Mathieu

arXiv.org Artificial IntelligenceJul-14-2025

Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.

artificial intelligence, natural language, trainee, (19 more...)

arXiv.org Artificial Intelligence

2507.0793

Country:

Europe (0.68)
North America > United States (0.67)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (1.00)
Instructional Material (1.00)
Personal > Interview (0.66)

Industry:

Health & Medicine > Therapeutic Area > Psychiatry/Psychology > Mental Health (0.46)
Education > Educational Technology > Educational Software > Computer Based Training (0.46)
Education > Educational Setting > Online (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Human Computer Interaction > Interfaces > Virtual Reality (0.67)

Add feedback

EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine

Neural Information Processing SystemsJan-17-2025, 15:45:37 GMT

envpool, reinforcement learning environment execution engine, training system, (1 more...)

Neural Information Processing Systems

Industry: Education (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.64)

Add feedback

Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

Lu, Ning, Xie, Qian, Zhang, Hao, Fang, Wenyi, Zheng, Yang, Hu, Zheng, Ma, Jiantao

arXiv.org Artificial IntelligenceSep-5-2024

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.

reliability, training system, training time, (11 more...)

arXiv.org Artificial Intelligence

2408.07482

Country: Asia > China > Hong Kong (0.05)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management

Wang, Yujie, Zhu, Shenhan, Fu, Fangcheng, Miao, Xupeng, Zhang, Jie, Zhu, Juan, Hong, Fan, Li, Yong, Cui, Bin

arXiv.org Artificial IntelligenceSep-5-2024

Recent foundation models are capable of handling multiple machine learning (ML) tasks and multiple data modalities with the unified base model structure and several specialized model components. However, the development of such multi-task (MT) multi-modal (MM) models poses significant model management challenges to existing training systems. Due to the sophisticated model architecture and the heterogeneous workloads of different ML tasks and data modalities, training these models usually requires massive GPU resources and suffers from sub-optimal system efficiency. In this paper, we investigate how to achieve high-performance training of large-scale MT MM models through data heterogeneity-aware model management optimization. The key idea is to decompose the model execution into stages and address the joint optimization problem sequentially, including both heterogeneity-aware workload parallelization and dependency-driven execution scheduling. Based on this, we build a prototype system and evaluate it on various large MT MM models. Experiments demonstrate the superior performance and efficiency of our system, with speedup ratio up to 71% compared to state-of-the-art training systems.

metaop, spindle, workload, (16 more...)

arXiv.org Artificial Intelligence

2409.03365

Country:

North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
(11 more...)

Genre: Research Report (0.82)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Multi-Agent Training for Pommerman: Curriculum Learning and Population-based Self-Play Approach

Huynh, Nhat-Minh, Cao, Hoang-Giang, Wu, I-Chen

arXiv.org Artificial IntelligenceJun-30-2024

Pommerman is a multi-agent environment that has received considerable attention from researchers in recent years. This environment is an ideal benchmark for multi-agent training, providing a battleground for two teams with communication capabilities among allied agents. Pommerman presents significant challenges for model-free reinforcement learning due to delayed action effects, sparse rewards, and false positives, where opponent players can lose due to their own mistakes. This study introduces a system designed to train multi-agent systems to play Pommerman using a combination of curriculum learning and population-based self-play. We also tackle two challenging problems when deploying the multi-agent training system for competitive games: sparse reward and suitable matchmaking mechanism. Specifically, we propose an adaptive annealing factor based on agents' performance to adjust the dense exploration reward during training dynamically. Additionally, we implement a matchmaking mechanism utilizing the Elo rating system to pair agents effectively. Our experimental results demonstrate that our trained agent can outperform top learning agents without requiring communication among allied agents.

agent, pommerman, training agent, (15 more...)

arXiv.org Artificial Intelligence

2407.00662

Country:

Asia > Taiwan (0.04)
North America > United States > New York (0.04)
Asia > Thailand (0.04)
Asia > South Korea (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Leisure & Entertainment > Games > Computer Games (0.46)
Leisure & Entertainment > Games > Chess (0.38)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)

Add feedback

Graph Neural Network Training Systems: A Performance Comparison of Full-Graph and Mini-Batch

Bajaj, Saurabh, Guan, Hui, Serafini, Marco

arXiv.org Artificial IntelligenceJun-8-2024

Graph Neural Networks (GNNs) have gained significant attention in recent years due to their ability to learn representations of graph structured data. Two common methods for training GNNs are mini-batch training and full-graph training. Since these two methods require different training pipelines and systems optimizations, two separate categories of GNN training systems emerged, each tailored for one method. Works that introduce systems belonging to a particular category predominantly compare them with other systems within the same category, offering limited or no comparison with systems from the other category. Some prior work also justifies its focus on one specific training method by arguing that it achieves higher accuracy than the alternative. The literature, however, has incomplete and contradictory evidence in this regard. In this paper, we provide a comprehensive empirical comparison of full-graph and mini-batch GNN training systems to get a clearer picture of the state of the art in the field. We find that the mini-batch training systems we consider consistently converge faster than the full-graph training ones across multiple datasets, GNN models, and system configurations, with speedups between 2.4x - 15.2x. We also find that both training techniques converge to similar accuracy values, so comparing systems across the two categories in terms of time-to-accuracy is a sound approach.

accuracy, hyperparameter, training system, (15 more...)

arXiv.org Artificial Intelligence

2406.00552

Country:

North America > United States > Massachusetts > Hampshire County > Amherst (0.14)
Europe > Greece (0.04)
Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

HetHub: A Heterogeneous distributed hybrid training system for large-scale models

Xu, Si, Huang, Zixiao, Zeng, Yan, Yan, Shengen, Ning, Xuefei, Ye, Haolin, Gu, Sipei, Shui, Chunsheng, Lin, Zhezheng, Zhang, Hao, Wang, Sheng, Dai, Guohao, Wang, Yu

arXiv.org Artificial IntelligenceMay-25-2024

The development of large-scale models relies on a vast number of computing resources. For example, the GPT-4 model (1.8 trillion parameters) requires 25000 A100 GPUs for its training. It is a challenge to build a large-scale cluster with a type of GPU-accelerator. Using multiple types of GPU-accelerators to construct a cluster is an effective way to solve the problem of insufficient homogeneous GPU-accelerators. However, the existing distributed training systems for large-scale models only support homogeneous GPU-accelerators, not heterogeneous GPU-accelerators. To address the problem, this paper proposes a distributed training system with hybrid parallelism support on heterogeneous GPU-accelerators for large-scale models. It introduces a distributed unified communicator to realize the communication between heterogeneous GPU-accelerators, a distributed performance predictor, and an automatic hybrid parallel module to develop and train models efficiently with heterogeneous GPU-accelerators. Compared to the distributed training system with homogeneous GPU-accelerators, our system can support six different combinations of heterogeneous GPU-accelerators and the optimal performance of heterogeneous GPU-accelerators has achieved at least 90% of the theoretical upper bound performance of homogeneous GPU-accelerators.

gpu-accelerator, parallelism, parallelism strategy, (16 more...)

arXiv.org Artificial Intelligence

2405.16256

Country:

Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
Europe > Czechia > Prague (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology:

Information Technology > Hardware (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.91)

Add feedback